In [ ]:
%%HTML
<style>
.container { width:100% } 
</style>

Logistic Regression


In [ ]:
import numpy as np

We need to define the sigmoid function $S(t) := \large \frac{1}{1 + \exp(-t)}$.


In [ ]:
def sigmoid(t):
    return 1.0 / (1.0 + np.exp(-t))

As we are using NumPy to compute $\exp(t)$, we can feed this function with a numpy array to compute the sigmoid function for every element of the array:


In [ ]:
sigmoid(np.array([-1.0, 0.0, 1.0]))

Next, we define the natural logarithm of the sigmoid function. If we implement this as log(sigmoid(t)) we will get overflow issues for negative values of $t$ such that $t < -1000$ as the expression np.exp(-t) will overflow.


In [ ]:
np.exp(1000)

Let us compute $$ \ln\bigl(S(-1000)\bigr) = \ln\Bigl(\frac{1}{1 + \exp(1000)}\Bigr) = - \ln\bigl(1 + \exp(1000)\bigr) \approx -1000. $$


In [ ]:
-np.log(1 + np.exp(1000))

This is not what we expected.


In [ ]:
np.exp(100)

On the other hand, for $t < -100$ we have that $1 + \exp(-t) \approx \exp(-t)$:


In [ ]:
1 + np.exp(-(-100)) == np.exp(-(-100))

Therefore, if $t < -100$ we have: $$ \begin{array}{lcl} \ln\left(\large\frac{1}{1+\exp(-t)}\right) & = & -\ln\bigl(1+\exp(-t)\bigr) \\ & \approx & -\ln\bigl(\exp(-t)\bigr) \\ & = & t \end{array} $$ Hence $\ln\bigl(S(t)\bigr) \approx t$ for $t < -100$. The following implementation uses this approximation.


In [ ]:
def logSigmoid(t):
    if t > -100:
        return -np.log(1.0 + np.exp(-t))
    else:
        return t

The function $\texttt{ll}(\textbf{X}, \textbf{y},\textbf{w})$ is mathematically defined as follows: $$\ell\ell(\mathbf{X},\mathbf{y},\mathbf{w}) = \sum\limits_{i=1}^N \ln\Bigl(S\bigl(y_i \cdot(\mathbf{x}_i \cdot \mathbf{w})\bigr)\Bigr) = \sum\limits_{i=1}^N L\bigl(y_i \cdot(\mathbf{x}_i \cdot \mathbf{w})\bigr) $$ The arguments $\textbf{X}$, $\textbf{y}$, and $\textbf{w}$ are interpreted as follows:

  • $\textbf{X}$ is the feature matrix, $\textbf{X}[i]$ is the $i$-th feature vector, i.e we have $\textbf{X}[i] = \textbf{x}_i$ if we regard $\textbf{x}_i$ as a row vector.

    It is assumed that $\textbf{X}[i][0]$ is 1.0 for all $i$.

  • $\textbf{y}$ is the output vector, $\textbf{y}[i] \in \{-1,+1\}$ for all $i$.
  • $\textbf{w}$ is the weight vector.
$\texttt{ll}(\textbf{X}, \textbf{y},\textbf{w})$ computes the likelihood of the weight vector $\textbf{w}$ given the observations $\textbf{X}$ and $\textbf{y}$.


In [ ]:
def ll(X, y, w):   
    return np.sum([logSigmoid(y[i] * (X[i] @ w)) for i in range(len(X))])

The function $\mathtt{gradLL}(\mathbf{x}, \mathbf{y}, \mathbf{w})$ computes the gradient of the log-lokelihood according to the formula $$ \frac{\partial\quad}{\partial\, w_j}\ell\ell(\mathbf{X},\mathbf{y};\mathbf{w}) = \sum\limits_{i=1}^N y_i \cdot x_{i,j} \cdot S(-y_i \cdot \mathbf{x}_i \cdot \mathbf{w}). $$ The different components of this gradient are combined into a vector. The arguments are the same as the arguments to the function ll that computes the log-likelihood, i.e.

  • $\textbf{X}$ is the feature matrix, $\textbf{X}[i]$ is the $i$-th feature vector, i.e we have
  • $\textbf{y}$ is the output vector, $\textbf{y}[i] \in \{-1,+1\}$ for all $i$.
  • $\textbf{w}$ is the weight vector.

In [ ]:
def gradLL(X, y, w):
    Gradient = []
    for j in range(len(X[0])):
        L = [y[i] * X[i][j] * sigmoid(-y[i] * (X[i] @ w)) for i in range(len(X))]
        Gradient.append(sum(L))
    return np.array(Gradient)

The data we want to investigate is stored in the file 'exam.csv'. The first column of this file is an integer from the set $\{0,1\}$. The nuber is $0$ if the corresponding student has failed the exam and is $1$ otherwise. The second column is a floating point number that lists the number of hours that the student has studied.


In [ ]:
import csv

In [ ]:
with open('exam.csv') as file:
    reader = csv.reader(file, delimiter=',')
    count  = 0  # line count
    Pass   = []
    Hours  = []
    for row in reader:
        if count != 0:  # skip header
            Pass .append(float(row[0]))
            Hours.append(float(row[1]))
        count += 1

To proceed, we will plot the data points. To this end we transform the lists Pass and Hours into numpy arrays.


In [ ]:
y = np.array(Pass)
x = np.array(Hours)

In [ ]:
x

In [ ]:
import matplotlib.pyplot as plt
import seaborn           as sns

In [ ]:
plt.figure(figsize=(15, 10))
sns.set(style='darkgrid')
plt.title('Pass/Fail vs. Hours of Study')
plt.axvline(x=0.0, c='k')
plt.axhline(y=0.0, c='k')
plt.xlabel('Hours of Study')
plt.ylabel('Pass = 1, Fail = 0')
plt.yticks(np.arange(-0.0, 1.0, step=0.1))
plt.scatter(x, y, color='b')

The number of students is stored in the variable n.


In [ ]:
n = len(y)
n

We have to turn the vector x into the feature matrix X.


In [ ]:
x.shape

In [ ]:
X = np.reshape(x, (n, 1))
X

We prepend the number $1.0$ in every row of X.


In [ ]:
X = np.append(np.ones((n, 1)), X, axis=-1)
X

Currently, the entries in the vector y are either $0$ or $1$. These values need to be transformed to $-1$ and $+1$.


In [ ]:
y = 2 * y - 1
y

As we have no real clue about the weights, we set them to $0$ initially.


In [ ]:
import gradient_ascent

In [ ]:
start   = np.zeros((2,))
eps     = 10 ** -8
f       = lambda w: ll(X, y, w)
gradF   = lambda w: gradLL(X, y, w)
w, _, _ = gradient_ascent.findMaximum(f, gradF, start, eps, True)
beta    = w[0]
gamma   = w[1]
print(f'model: P(pass|hours) = S({beta} + {gamma} * hours)')

Let us plot this function together with the data.


In [ ]:
plt.figure(figsize=(15, 9))
sns.set(style='darkgrid')
plt.title('Pass/Fail vs. Hours of Study')
H = np.arange(0.0, 6.0, 0.05)
P = sigmoid(beta + gamma * H)
sns.lineplot(H, P, color='r')
plt.axvline(x=0.0, c='k')
plt.axhline(y=0.0, c='k')
plt.xlabel('Hours of Study')
plt.ylabel('Probability of Passing the Exam')
plt.yticks(np.arange(-0.0, 1.01, step=0.1))
plt.scatter(x, (y + 1) / 2, color='b')
plt.savefig('exam-probability.pdf')

In [ ]: